Review of Midterm

Poison distribution question

Midterm answers on OSF site.

Linear Regression

Quick review to go over material from Lecture 8.

Example of multiplicative multiple linear regression

Paruelo and Lauenroth (1996) examined the relationships between the abundance of \(C_3\) plants (those that use \(C_3\) photosynthesis) and geographic and climactic variables. Here we are only going to consider the geographic variables.

Get data and examine.

c3_plants <- read.csv(file = "https://mlammens.github.io/Biostats/data/Logan_Examples/Chapter9/Data/paruelo.csv")
summary(c3_plants)
##        C3              MAP            MAT             JJAMAP      
##  Min.   :0.0000   Min.   : 117   Min.   : 2.000   Min.   :0.1000  
##  1st Qu.:0.0500   1st Qu.: 345   1st Qu.: 6.900   1st Qu.:0.2000  
##  Median :0.2100   Median : 421   Median : 8.500   Median :0.2900  
##  Mean   :0.2714   Mean   : 482   Mean   : 9.999   Mean   :0.2884  
##  3rd Qu.:0.4700   3rd Qu.: 575   3rd Qu.:12.900   3rd Qu.:0.3600  
##  Max.   :0.8900   Max.   :1011   Max.   :21.200   Max.   :0.5100  
##      DJFMAP            LONG            LAT       
##  Min.   :0.1100   Min.   : 93.2   Min.   :29.00  
##  1st Qu.:0.1500   1st Qu.:101.8   1st Qu.:36.83  
##  Median :0.2000   Median :106.5   Median :40.17  
##  Mean   :0.2275   Mean   :106.4   Mean   :40.10  
##  3rd Qu.:0.3100   3rd Qu.:111.8   3rd Qu.:43.95  
##  Max.   :0.4900   Max.   :119.5   Max.   :52.13

Again, we’re only going to consider the geographic variable - latitude and longitude (LONG and LAT).

Have a look at these data in graphical format.

library(ggplot2)
library(GGally)

ggpairs(c3_plants)

C3 abundance is not normally distributed. Let’s convert using a log10(C3 + 0.1) conversion. The + 0.1 is needed because the log of 0 is negative infinity.

Also, we should center the lat and long data.

c3_plants$LAT <- scale(c3_plants$LAT, scale = FALSE)
c3_plants$LONG <- scale(c3_plants$LONG, scale = FALSE)

Make and look at non-interaction model.

c3_plants_lm_noint <- lm(data = c3_plants, log10(C3 + 0.1) ~ LONG + LAT)

summary(c3_plants_lm_noint)
## 
## Call:
## lm(formula = log10(C3 + 0.1) ~ LONG + LAT, data = c3_plants)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.50434 -0.20010 -0.02084  0.20813  0.43932 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.545622   0.028395 -19.216  < 2e-16 ***
## LONG        -0.003737   0.004464  -0.837    0.405    
## LAT          0.042430   0.005417   7.833 3.71e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2426 on 70 degrees of freedom
## Multiple R-squared:  0.4671, Adjusted R-squared:  0.4519 
## F-statistic: 30.68 on 2 and 70 DF,  p-value: 2.704e-10

Make and look at model with interactions. We can create a model with an interaction term in two ways. They yield identical results.

c3_plants_lm1 <- lm(data = c3_plants, log10(C3 + 0.1) ~ LONG + LAT + LONG:LAT)

c3_plants_lm2 <- lm(data = c3_plants, log10(C3 + 0.1) ~ LONG*LAT)

Check out the summary results of both.

summary(c3_plants_lm1)
## 
## Call:
## lm(formula = log10(C3 + 0.1) ~ LONG + LAT + LONG:LAT, data = c3_plants)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.54185 -0.13298 -0.02287  0.16807  0.43410 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.5529416  0.0274679 -20.130  < 2e-16 ***
## LONG        -0.0025787  0.0043182  -0.597   0.5523    
## LAT          0.0483954  0.0057047   8.483 2.61e-12 ***
## LONG:LAT     0.0022522  0.0008757   2.572   0.0123 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2334 on 69 degrees of freedom
## Multiple R-squared:  0.5137, Adjusted R-squared:  0.4926 
## F-statistic:  24.3 on 3 and 69 DF,  p-value: 7.657e-11
summary(c3_plants_lm2)
## 
## Call:
## lm(formula = log10(C3 + 0.1) ~ LONG * LAT, data = c3_plants)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.54185 -0.13298 -0.02287  0.16807  0.43410 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.5529416  0.0274679 -20.130  < 2e-16 ***
## LONG        -0.0025787  0.0043182  -0.597   0.5523    
## LAT          0.0483954  0.0057047   8.483 2.61e-12 ***
## LONG:LAT     0.0022522  0.0008757   2.572   0.0123 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2334 on 69 degrees of freedom
## Multiple R-squared:  0.5137, Adjusted R-squared:  0.4926 
## F-statistic:  24.3 on 3 and 69 DF,  p-value: 7.657e-11

Interpreting the partial regression coefficients

The partial regression coefficients (i.e., partial slopes) are the slopes of specific predictor variables, holding all other predictor variables constant at their mean values.


Look at diagnostic plots for model with interaction term.

plot(c3_plants_lm1)

Model selection

We want a model that contains enough predictor variables to explain the variation observed, but not one that is over fit. Also, it is important that we not lose focus of the biological questions we are asking. Sometimes, it is best to keep certain predictor variables in a model, even if they are not statistically important, if they are essential to answering our question.

Compare the models without and with an interaction term.

Using anova

Compares the reduction in the residual sums of squares for nested models.

anova(c3_plants_lm_noint,c3_plants_lm1)
## Analysis of Variance Table
## 
## Model 1: log10(C3 + 0.1) ~ LONG + LAT
## Model 2: log10(C3 + 0.1) ~ LONG + LAT + LONG:LAT
##   Res.Df    RSS Df Sum of Sq     F  Pr(>F)  
## 1     70 4.1200                             
## 2     69 3.7595  1   0.36043 6.615 0.01227 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Using AIC

Akaike Information Criteria - a relative measure of the information content of a model. Smaller values indicate a more parimonious model. Penalizes models with larger number of predictor variables. As a rule of thumb, differences of greater than 2 (i.e., \(\Delta AIC >2\)) are considered meaningful.

AIC(c3_plants_lm1)
## [1] 0.6350526
AIC(c3_plants_lm_noint)
## [1] 5.318109

The effect of the interaction

Let’s look at the effect of the interaction between LAT and LONG by examining the partial regression coefficient for LONG at different values of LAT. Here we are going to look at the mean latitude value, \(\pm\) 1 and 2 standard deviations.

## mean lat - 2SD 
LAT_sd1 <- mean(c3_plants$LAT)-2*sd(c3_plants$LAT)
c3_plants_LONG.lm1<-lm(log10(C3+.1)~LONG*c(LAT-LAT_sd1), data=c3_plants)
summary(c3_plants_LONG.lm1)
## 
## Call:
## lm(formula = log10(C3 + 0.1) ~ LONG * c(LAT - LAT_sd1), data = c3_plants)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.54185 -0.13298 -0.02287  0.16807  0.43410 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -1.0662239  0.0674922 -15.798  < 2e-16 ***
## LONG                  -0.0264657  0.0098255  -2.694  0.00887 ** 
## c(LAT - LAT_sd1)       0.0483954  0.0057047   8.483 2.61e-12 ***
## LONG:c(LAT - LAT_sd1)  0.0022522  0.0008757   2.572  0.01227 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2334 on 69 degrees of freedom
## Multiple R-squared:  0.5137, Adjusted R-squared:  0.4926 
## F-statistic:  24.3 on 3 and 69 DF,  p-value: 7.657e-11
## mean lat - 1SD 
LAT_sd2<-mean(c3_plants$LAT) - 1*sd(c3_plants$LAT)
c3_plants_LONG.lm2<-lm(log10(C3+.1)~LONG*c(LAT-LAT_sd2), data=c3_plants)
summary(c3_plants_LONG.lm2)
## 
## Call:
## lm(formula = log10(C3 + 0.1) ~ LONG * c(LAT - LAT_sd2), data = c3_plants)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.54185 -0.13298 -0.02287  0.16807  0.43410 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -0.8095827  0.0417093 -19.410  < 2e-16 ***
## LONG                  -0.0145222  0.0060025  -2.419   0.0182 *  
## c(LAT - LAT_sd2)       0.0483954  0.0057047   8.483 2.61e-12 ***
## LONG:c(LAT - LAT_sd2)  0.0022522  0.0008757   2.572   0.0123 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2334 on 69 degrees of freedom
## Multiple R-squared:  0.5137, Adjusted R-squared:  0.4926 
## F-statistic:  24.3 on 3 and 69 DF,  p-value: 7.657e-11
## mean lat + 1SD 
LAT_sd4<-mean(c3_plants$LAT) + 1*sd(c3_plants$LAT)
c3_plants_LONG.lm4<-lm(log10(C3+.1)~LONG*c(LAT-LAT_sd4), data=c3_plants)
summary(c3_plants_LONG.lm4)
## 
## Call:
## lm(formula = log10(C3 + 0.1) ~ LONG * c(LAT - LAT_sd4), data = c3_plants)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.54185 -0.13298 -0.02287  0.16807  0.43410 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -0.2963004  0.0399955  -7.408 2.41e-10 ***
## LONG                   0.0093647  0.0066628   1.406   0.1643    
## c(LAT - LAT_sd4)       0.0483954  0.0057047   8.483 2.61e-12 ***
## LONG:c(LAT - LAT_sd4)  0.0022522  0.0008757   2.572   0.0123 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2334 on 69 degrees of freedom
## Multiple R-squared:  0.5137, Adjusted R-squared:  0.4926 
## F-statistic:  24.3 on 3 and 69 DF,  p-value: 7.657e-11
## mean lat + 2SD 
LAT_sd5<-mean(c3_plants$LAT) + 2*sd(c3_plants$LAT)
c3_plants_LONG.lm5<-lm(log10(C3+.1)~LONG*c(LAT-LAT_sd5), data=c3_plants)
summary(c3_plants_LONG.lm5)
## 
## Call:
## lm(formula = log10(C3 + 0.1) ~ LONG * c(LAT - LAT_sd5), data = c3_plants)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.54185 -0.13298 -0.02287  0.16807  0.43410 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -0.0396593  0.0653846  -0.607   0.5461    
## LONG                   0.0213082  0.0106426   2.002   0.0492 *  
## c(LAT - LAT_sd5)       0.0483954  0.0057047   8.483 2.61e-12 ***
## LONG:c(LAT - LAT_sd5)  0.0022522  0.0008757   2.572   0.0123 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2334 on 69 degrees of freedom
## Multiple R-squared:  0.5137, Adjusted R-squared:  0.4926 
## F-statistic:  24.3 on 3 and 69 DF,  p-value: 7.657e-11

Note how the partial regression slope for LONG goes from -0.026 to +0.021 and remember this link from two weeks ago: Visualizing Relations in Multiple Regression


Linear Regression with Quadratic Term (Polynomial Regression)

Some times your trend isn’t quite a straight line. One way to deal with this is to add a quadratic term in your regression.

\[ y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x^2_{i1} + \epsilon_i \]

Example of polynomial regression

As an example, let’s look at an unpublished data set described in Sokal and Rholf (Biometry, 1997). These data show the frequency of the Lap94 allele in populations of Mytilus edulis and the distance from Southport.

blue mussel

Get the data

blue_mussel <- read.csv(file = "https://mlammens.github.io/Biostats/data/Logan_Examples/Chapter9/Data/mytilus.csv")
summary(blue_mussel)
##       LAP              DIST      
##  Min.   :0.1100   Min.   : 1.00  
##  1st Qu.:0.1600   1st Qu.:18.50  
##  Median :0.3150   Median :42.00  
##  Mean   :0.3171   Mean   :37.32  
##  3rd Qu.:0.4600   3rd Qu.:54.00  
##  Max.   :0.5450   Max.   :67.00

Note that the predictor variable is a frequency, and is bounded by 0 and 1. It is common (though controversial) to use an ArcSine transformation with frequency data. We’ll use the arcsine-square root transformation here (Logan, p. 69).

blue_mussel$asinLAP <- asin(sqrt(blue_mussel$LAP)) * 180/pi

Let’s visualize the data

ggplot(data = blue_mussel, aes(x = DIST, y = asinLAP)) +
  geom_point() +
  theme_bw() +
  stat_smooth()

Begin by building a simple linear regression model

blue_mussel_lm <- lm(data = blue_mussel, formula = asinLAP ~ DIST)
summary(blue_mussel_lm)
## 
## Call:
## lm(formula = asinLAP ~ DIST, data = blue_mussel)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.5594 -3.1252  0.9808  2.7398  6.6314 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 16.08621    2.31643   6.944 4.70e-06 ***
## DIST         0.46731    0.05498   8.500 4.05e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.431 on 15 degrees of freedom
## Multiple R-squared:  0.8281, Adjusted R-squared:  0.8166 
## F-statistic: 72.24 on 1 and 15 DF,  p-value: 4.052e-07
ggplot(data = blue_mussel, aes(x = DIST, y = asinLAP)) +
  geom_point() +
  theme_bw() +
  stat_smooth(method = "lm")

Check diagnostics

plot(blue_mussel_lm)

Particularly pay attention to the residuals versus fitted values in the diagnostic plots.

Now, let’s build a polynomial regression model with the addition of a quadratic term

blue_mussel_lm2 <- lm(data = blue_mussel, formula = asinLAP ~ DIST + I(DIST^2))
summary(blue_mussel_lm2)
## 
## Call:
## lm(formula = asinLAP ~ DIST + I(DIST^2), data = blue_mussel)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3994 -2.4435 -0.9135  2.8193  7.0820 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 20.268605   3.131423   6.473 1.47e-05 ***
## DIST         0.091456   0.210714   0.434   0.6709    
## I(DIST^2)    0.005547   0.003017   1.839   0.0873 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.116 on 14 degrees of freedom
## Multiple R-squared:  0.8615, Adjusted R-squared:  0.8417 
## F-statistic: 43.54 on 2 and 14 DF,  p-value: 9.773e-07
ggplot(data = blue_mussel, aes(x = DIST, y = asinLAP)) +
  geom_point() +
  theme_bw() +
  stat_smooth(method = "lm", formula = y ~ x + I(x^2) )

Is this a better model?

anova(blue_mussel_lm2, blue_mussel_lm)
## Analysis of Variance Table
## 
## Model 1: asinLAP ~ DIST + I(DIST^2)
## Model 2: asinLAP ~ DIST
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     14 237.22                              
## 2     15 294.50 -1   -57.277 3.3803 0.08729 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Marginally.

Let’s look at one more polynomial (cubic).

Challenge

Make the cubic model and test if it is a better fitting model than the linear or quadratic.

blue_mussel_lm3 <- lm(data = blue_mussel, formula = asinLAP ~ DIST + I(DIST^2) + I(DIST^3))
summary(blue_mussel_lm3)
## 
## Call:
## lm(formula = asinLAP ~ DIST + I(DIST^2) + I(DIST^3), data = blue_mussel)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1661 -2.1360 -0.3908  1.9016  6.0079 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 26.2232524  3.4126910   7.684 3.47e-06 ***
## DIST        -0.9440845  0.4220118  -2.237  0.04343 *  
## I(DIST^2)    0.0421452  0.0138001   3.054  0.00923 ** 
## I(DIST^3)   -0.0003502  0.0001299  -2.697  0.01830 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.421 on 13 degrees of freedom
## Multiple R-squared:  0.9112, Adjusted R-squared:  0.8907 
## F-statistic: 44.46 on 3 and 13 DF,  p-value: 4.268e-07
ggplot(data = blue_mussel, aes(x = DIST, y = asinLAP)) +
  geom_point() +
  theme_bw() +
  stat_smooth(method = "lm", formula = y ~ x + I(x^2) + I(x^3) )

And is this model better?

anova(blue_mussel_lm3, blue_mussel_lm2)
## Analysis of Variance Table
## 
## Model 1: asinLAP ~ DIST + I(DIST^2) + I(DIST^3)
## Model 2: asinLAP ~ DIST + I(DIST^2)
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)  
## 1     13 152.12                             
## 2     14 237.22 -1   -85.108 7.2734 0.0183 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(blue_mussel_lm3, blue_mussel_lm)
## Analysis of Variance Table
## 
## Model 1: asinLAP ~ DIST + I(DIST^2) + I(DIST^3)
## Model 2: asinLAP ~ DIST
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     13 152.12                              
## 2     15 294.50 -2   -142.38 6.0842 0.01365 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What about AIC?

AIC(blue_mussel_lm)
## [1] 102.729
AIC(blue_mussel_lm2)
## [1] 101.0522
AIC(blue_mussel_lm3)
## [1] 95.49808

We can also check the case of adding yet another polynomial factor, \(x^4\).

AIC(lm(data = blue_mussel, formula = asinLAP ~ DIST + I(DIST^2) + I(DIST^3) + I(DIST^4)))
## [1] 96.11887

Non-linear regression

Species-area relationship curves are often observed to have some sort of Power Law relationship:

Number of species = a constant * Areaz

Peake and Quinn (1993) examined this relationship for invertebrates living in inter-tidal mussel clumps (Logan - Example 9G; Quinn and Keough - Box 6.11).

Get the data.

sar <- read.csv(file = "https://mlammens.github.io/Biostats/data/Logan_Examples/Chapter9/Data/peake.csv")
head(sar)
##      AREA SPECIES INDIV
## 1  516.00       3    18
## 2  469.06       7    60
## 3  462.25       6    57
## 4  938.60       8   100
## 5 1357.15      10    48
## 6 1773.66       9   118
summary(sar)
##       AREA            SPECIES       INDIV       
##  Min.   :  462.2   Min.   : 3   Min.   :  18.0  
##  1st Qu.: 1773.7   1st Qu.:10   1st Qu.: 148.0  
##  Median : 4451.7   Median :14   Median : 338.0  
##  Mean   : 7802.0   Mean   :15   Mean   : 446.9  
##  3rd Qu.: 9287.7   3rd Qu.:21   3rd Qu.: 632.0  
##  Max.   :27144.0   Max.   :25   Max.   :1402.0

Visualize these data.

ggplot(data = sar, aes(x = AREA, y = SPECIES)) +
  geom_point() +
  theme_bw() +
  stat_smooth()

Naively fit a linear model to these data.

sar_lm <- lm(data = sar, SPECIES ~ AREA)
plot(sar_lm)

Now let’s use nls to fit a power law model.

sar_nls <- nls(data = sar, formula = SPECIES ~ const * AREA^z, 
               start = list(const = 0.1, z = 1))
summary(sar_nls)
## 
## Formula: SPECIES ~ const * AREA^z
## 
## Parameters:
##       Estimate Std. Error t value Pr(>|t|)    
## const   0.8584     0.2769   3.100  0.00505 ** 
## z       0.3336     0.0350   9.532 1.87e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.733 on 23 degrees of freedom
## 
## Number of iterations to convergence: 17 
## Achieved convergence tolerance: 1.037e-06

Make a residuals versus fitted plot for these data

plot(resid(sar_nls) ~ fitted(sar_nls))

Compare the linear and non-linear models using AIC.

AIC(sar_lm)
## [1] 141.0756
AIC(sar_nls)
## [1] 125.1312

Briefly review Table 9.1 on page 212 of Logan.

Introduction to Analysis of Variance - ANOVA

While learning about regression, we’ve been working with both continuous predictor and response variables. But in biology there are many times when we are working with categorical predictor variables and continuous responses. For example, let’s consider a fertilizer addition experiment, where the growth of some plant species is measured under a fertilizer versus no-fertilizer treatment. Here, the predictor variable (fertilizer addition or not) is a categorical variable.

Challenge

We have worked with this kind of simple case, where we have one-predictor variable which can take two values. What did we do in this case?

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
data(iris)
t.test( x = filter(iris, Species == "setosa")$Petal.Length,
        y = filter(iris, Species == "versicolor")$Petal.Length )
## 
##  Welch Two Sample t-test
## 
## data:  filter(iris, Species == "setosa")$Petal.Length and filter(iris, Species == "versicolor")$Petal.Length
## t = -39.493, df = 62.14, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.939618 -2.656382
## sample estimates:
## mean of x mean of y 
##     1.462     4.260
anova(lm(data = filter(iris, Species != "virginica"), formula = Petal.Length ~ Species))
## Analysis of Variance Table
## 
## Response: Petal.Length
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## Species    1 195.720 195.720  1559.7 < 2.2e-16 ***
## Residuals 98  12.298   0.125                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If we have more than two values within a single predictor variable, then we cannot use a t-test. Here we go to an ANOVA.

anova(lm(data = iris, formula = Petal.Length ~ Species))
## Analysis of Variance Table
## 
## Response: Petal.Length
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## Species     2 437.10 218.551  1180.2 < 2.2e-16 ***
## Residuals 147  27.22   0.185                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Visual understanding of ANOVA

First, recall the visual understanding of linear regression:

Logan - Figure 8.3

Now, what would this look like if the x value (i.e., the predictor) was categorical?

Logan - Figure 10.2

Fixed versus Random Effects

Fixed factors are ones we have manipulated, or that we expect a priori to have an effect on the response. From Logan, p 254 - “conclusions about the effects of a fixed factor are restricted to the specific treatment levels investigated …”

Random factors are, as per Logan, p. 254, “randomly chosen from all the possible levels of populations and are used as random representatives of the populations”. E.g., density of plants could be random. Or position of camera traps on a hillside.

Null hypotheses

Fixed effects model:

\[ H_0: \mu_1 = \mu_2 = ... = \mu_i = \mu \]

Random effects model:

\[ H_0: \sigma^2_1 = \sigma^2_2 = ... = \sigma^2_i = \sigma^2 \]